{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"images/dask_horizontal.svg\" align=\"right\" width=\"30%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "Welcome to the Dask Tutorial.\n",
    "\n",
    "Dask is a parallel computing library that scales the existing Python ecosystem. This tutorial will introduce Dask and parallel data analysis more generally.\n",
    "\n",
    "Dask can scale down to your laptop and up to a cluster. Here, we'll use an environment you setup on your laptop to analyze medium sized datasets in parallel locally."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dask provides multi-core and distributed parallel execution on larger-than-memory datasets.\n",
    "\n",
    "We can think of Dask at a high and a low level\n",
    "\n",
    "*  **High level collections:**  Dask provides high-level Array, Bag, and DataFrame\n",
    "   collections that mimic NumPy, lists, and Pandas but can operate in parallel on\n",
    "   datasets that don't fit into memory.  Dask's high-level collections are\n",
    "   alternatives to NumPy and Pandas for large datasets.\n",
    "*  **Low Level schedulers:** Dask provides dynamic task schedulers that\n",
    "   execute task graphs in parallel.  These execution engines power the\n",
    "   high-level collections mentioned above but can also power custom,\n",
    "   user-defined workloads.  These schedulers are low-latency (around 1ms) and\n",
    "   work hard to run computations in a small memory footprint.  Dask's\n",
    "   schedulers are an alternative to direct use of `threading` or\n",
    "   `multiprocessing` libraries in complex cases or other task scheduling\n",
    "   systems like `Luigi` or `IPython parallel`.\n",
    "\n",
    "Different users operate at different levels but it is useful to understand\n",
    "both.\n",
    "\n",
    "The Dask [use cases](https://stories.dask.org/en/latest/) provides a number of sample workflows where Dask should be a good fit."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should clone this repository: \n",
    "\n",
    "    git clone http://github.com/dask/dask-tutorial\n",
    "\n",
    "The included file `environment.yml` in the `binder` subdirectory contains a list of all of the packages needed to run this tutorial. To install them using `conda`, you can do\n",
    "\n",
    "    conda env create -f binder/environment.yml\n",
    "    conda activate dask-tutorial\n",
    "    jupyter labextension install @jupyter-widgets/jupyterlab-manager\n",
    "    jupyter labextension install @bokeh/jupyter_bokeh\n",
    "    \n",
    "Do this *before* running this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Links"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*  Reference\n",
    "    *  [Docs](https://dask.org/)\n",
    "    *  [Examples](https://examples.dask.org/)\n",
    "    *  [Code](https://github.com/dask/dask/)\n",
    "    *  [Blog](https://blog.dask.org/)\n",
    "*  Ask for help\n",
    "    *   [`dask`](http://stackoverflow.com/questions/tagged/dask) tag on Stack Overflow, for usage questions\n",
    "    *   [github issues](https://github.com/dask/dask/issues/new) for bug reports and feature requests\n",
    "    *   [gitter chat](https://gitter.im/dask/dask) for general, non-bug, discussion\n",
    "    *   Attend a live tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tutorial Structure\n",
    "\n",
    "Each section is a Jupyter notebook. There's a mixture of text, code, and exercises.\n",
    "\n",
    "If you haven't used Jupyterlab, it's similar to the Jupyter Notebook. If you haven't used the Notebook, the quick intro is\n",
    "\n",
    "1. There are two modes: command and edit\n",
    "2. From command mode, press `Enter` to edit a cell (like this markdown cell)\n",
    "3. From edit mode, press `Esc` to change to command mode\n",
    "4. Press `shift+enter` to execute a cell and move to the next cell.\n",
    "\n",
    "The toolbar has commands for executing, converting, and creating cells.\n",
    "\n",
    "The layout of the tutorial will be as follows:\n",
    "- Foundations: an explanation of what Dask is, how it works, and how to use lower-level primitives to set up computations. Casual users may wish to skip this section, although we consider it useful knowledge for all users.\n",
    "- Distributed: information on running Dask on the distributed scheduler, which enables scale-up to distributed settings and enhanced monitoring of task operations. The distributed scheduler is now generally the recommended engine for executing task work, even on single workstations or laptops.\n",
    "- Collections: convenient abstractions giving a familiar feel to big data\n",
    "    - bag: Python iterators with a functional paradigm, such as found in func/iter-tools and toolz - generalize lists/generators to big data; this will seem very familiar to users of PySpark's [RDD](http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.RDD)\n",
    "    - array: massive multi-dimensional numerical data, with Numpy functionality\n",
    "    - dataframes: massive tabular data, with Pandas functionality\n",
    "    \n",
    "Whereas there is a wealth of information in the documentation, linked above, here we aim to give practical advice to aid your understanding and application of Dask in everyday situations. This means that you should not expect every feature of Dask to be covered, but the examples hopefully are similar to the kinds of work-flows that you have in mind.\n",
    "\n",
    "## Exercise: Print `Hello, world!`\n",
    "Each notebook will have exercises for you to solve. You'll be given a blank or partially completed cell, followed by a hidden cell with a solution. For example.\n",
    "\n",
    "\n",
    "Print the text \"Hello, world!\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next cell has the solution. Click the ellipses to expand the solution, and always make sure to run the solution cell,\n",
    "in case later sections of the notebook depend on the output from the solution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "print(\"Hello, world!\")"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
